INTENT CLASSIFICATION

Importing Libraries

Reading JSON files from hardcoded file paths

Reading JSON files from Config File

EDA

Train dataframe contains 150 intents, with each intent having 100 occurrences.

Validation dataframe contains 150 intents, with each intent having 20 occurrences.

Test dataframe contains 150 intents, with each intent having 30 occurrences.

CHECKING IF DATA HAS EMAILS, URLs, MENTIONS and HASHTAGS IN THEM

1. LINGUISTIC ANALYSIS

1. WORD FREQUENCIES

DATA PRE-PROCESSING ON TRAIN SET

DATA PRE-PROCESSING ON VALIDATION SET

DATA PRE-PROCESSING ON TEST SET

1.1 WORD FREQUENCIES PER INTENTS (6 random intents)

NOTE: The ngram plots are done via plotly package. And there is a bug in the plotly package that it doesn't show the plots offline when you close the notebook and re-open it the next time. So, I downloaded the html format of this whole ipynb where you can access the interactive plots (for your reference)

2. NGRAMS

2.1 Bigrams on the full Corpus

2.2 Trigrams on the full Corpus

2.3 Four Words on the full Corpus

2.4 BIGRAMS FOR INTENT -> "car_rental"

2.5 TRIGRAMS FOR INTENT -> "transfer"

2.6 FOUR WORDS FOR INTENT -> "next_holiday"

3. WORDCLOUD

3.1 WORDCLOUD PER SPECIFIC INTENT (2 random intents)

SAVING TRAIN, VAL and TEST DATAFRAMES TO CSV FILES

FILTERING THE DATAFRAME ON THREE RANDOM INTENTS

FILTERING ON TRAIN SET

FILTERING ON VALIDATION SET

FILTERING ON TEST SET

LINGUISTIC ANALYSIS FOR THESE THREE INTENTS

WORD FREQUENCIES

NGRAMS

WORDCLOUD

2. MODELLING WITH DIFFERENT WORD EMBEDDINGS

STATIC WORD EMBEDDINGS

1. Word2vec (SKIPGRAM)

We have to combine the vectors (vectors -> vector representations of each word) together and get a new one that represents the document as a whole. Here, we are computing a weighted average where each weight gives the importance of the word with respect to the corpus. Such a weight could be the tf-idf score.

Word2Vec SKIPGRAM on THREE INTENTS DATAFRAMES

2. Count Vectorizer

Count Vectorizer on THREE INTENTS DATAFRAMES

CONTEXTUAL WORD EMBEDDINGS

1. SENTENCE BERT

SENTENCE BERT ON THREE INTENTS DATAFRAMES

MODEL BUILDING

1. RANDOM FOREST

RANDOM FOREST with Word2Vec (SKIPGRAM)

RANDOM FOREST with Count Vectorizer

RANDOM FOREST with SENTENCE BERT

2. LOGISTIC REGRESSION

LOGISTIC REGRESSION with Word2Vec (SKIPGRAM)

LOGISTIC REGRESSION with Count Vectorizer

LOGISTIC REGRESSION with SENTENCE BERT

3. Sequence Labelling - BIO TAGGING

Reference

https://huggingface.co/transformers/v4.2.2/custom_datasets.html
https://huggingface.co/transformers/v4.2.2/custom_datasets.html#ft-native

Loading the tagged data

Simple BERT

Distil BERT

TRAINING MODE

EVAL MODE